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Abstract 



Residue-wise contact order (RWCO) is a new kind of one-dimensional 
protein structures which represents the extent of long-range contacts. We 
have recently shown that a set of three types of one-dimensional structures 
(secondary structure, contact number, and RWCO) contains sufficient infor- 
mation for reconstructing the three-dimensional structure of proteins. Cur- 
rently, there exist prediction methods for secondary structure and contact 
number from amino acid sequence, but none exists for RWCO. Also, the 
properties of amino acids that affect RWCO is not clearly understood. Here, 
we present a linear regression-based method to predict RWCO from amino 
acid sequence, and analyze the regression parameters to identify the prop- 
erties that correlates with the RWCO. The present method achieves the sig- 
nificant correlation of 0.59 between the native and predicted RWCOs on av- 
erage. An unusual feature of the RWCO prediction is the remarkably large 
optimal half window size of 26 residues. The regression parameters for 
the central and near-central residues of the local sequence segment highly 
correlate with those of the contact number prediction, and hence with hy- 
drophobicity. 

Key words: protein structure prediction, residue-wise contact order, one-dimensional 
structure, linear regression. 
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Introduction 



One of the main goals of protein structure prediction is to provide an intuitive 
picture of the relationship between the amino acid sequence and the native three- 
dimensional (3D) structure of proteins. To this end, a number of methods have 
been developed for ab initio or de novo protein structure prediction. However, 
such methods are usually very complicated and make it difficult to intuitively 
understand the relationship between amino acid sequence and 3D structure. In 
this respect, one-dimensional (ID) structures' of proteins may be conventional 
intermediate representations of both sequence and structure of proteins as it is 
easy to grasp the correspondence between sequence and structural characteristics. 

Since ID structures are 3D structural features projected onto strings of residue- 
wise structural assignments \ a large part of 3D information appears to be lost. 
That is, the correspondence between amino acid sequence and ID structures does 
not seem to be sufficient for uncovering the correspondence between amino acid 
sequence and 3D structure. However, Porto et al? have recently shown that the 
contact matrix of a protein structure can be uniquely recovered from its princi- 
pal eigenvector. Since the protein 3D structure can be recovered from the contact 
matrix^, the result of Porto et a/.^ indicates that the information contained in the 
3D structure can be expressed as a one-dimensional representation. Furthermore, 
we have recently shown that 3D structure of proteins can be reconstructed from a 
set of three types of ID structured. In other words, the 3D structure of a protein 
is essentially equivalent to a set of three types of ID structures. These ID struc- 
tures are namely secondary structure, contact number and residue-wise contact 
order. The fact that the 3D structure of a protein can be recovered from a set of 
these ID structures opens a new possibility for elucidating the sequence- structure 
relationship of proteins. 

The secondary structure of a protein is a string of symbols representing a 
helix, (3 strand, or coils. The contact number of each residue in a protein is defined 
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by the number of contacts the residue makes with other residues in the protein. 
More precisely, if we represent the contact map of the protein by Cij (Cij = 1 
if the i-th and j-th residues are in contact, or Cij = otherwise), the contact 
number rij of the i-th residue is defined by Ui = Y.j Cij. Similarly, the residue- 
wise contact order (RWCO) Oj of the ?-th residue of a protein is defined by Oj = 
Y,j \i — that is, a sum of sequence separations between the residue and 

the contacting residue^. The contact order was first introduced as a per-protein 
quantity by Plaxco et alP to study the correlation between protein topology and 
folding rate. The RWCO introduced here is a generalization of the contact order, 
and is a per-residue quantity. 

At least in principle, if we can predict those ID structures, we can also con- 
struct the corresponding 3D structures. Many accurate methods have been devel- 
oped for secondary structure predictiorP. We have developed a method to predict 
the contact number from amino acid sequencd^ with the average correlation of 
0.63 between the native and predicted contact numbers. However, there is no 
method for predicting RWCO from amino acid sequence to date, and it is not 
clear if the prediction is possible at all. The primary objective of the present paper 
is to develop a method to predict RWCO from amino acid sequence. 

While the accurate prediction of structural properties is important for its own 
sake, for a thorough understanding of the sequence- structure relationship, we still 
need to identify the properties of amino acid sequence that determine the structure. 
From the vast amount of studies on secondary structure prediction in the past, we 
are now convinced that each amino acid has a particular propensity for a particular 
secondary structure, although the final secondary structures in the native structure 
are determined in the global context. Also, contact number is closely related to 
the hydrophobicity of amino acids. Thus, both secondary structure and contact 
number have clear connections with the properties of amino acids. As for the 
residue-wise contact order, its geometrical meaning is clear (i.e., a quantity related 
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to the extent of long-range contacts), but the conjugate properties of amino acids 
are not. As the second objective of the present study, we attempt to identify the 
amino acids' property affecting RWCO by examining the parameters derived for 
the prediction method. 

The prediction method developed in this paper is based on a simple linear 
regression scheme which was also applied to the contact number prediction in 
our previous study ^. By examining the regression parameters, we show that the 
RWCO is primarily determined by the pattern of hydrophobicity of amino acids. 
Although the method is extremely simple, it yields a significant correlation of 0.59 
between the native and predicted RWCOs. While further refinement is definitely 
necessary to apply the method for 3D structure prediction, the present method will 
serve as a basis for more elaborate methods yet to be developed. 

Materials and Method 

Definition of residue-wise contact order 

As mentioned in the Introduction, the residue- wise contact order (RWCO) of the 
i-th residue is defined by 

= y (1) 

where the summation is normalized by the length L of the amino acid sequence of 
the protein and Cj j represents the contact map of the protein. We exclude trivial 
contacts between nearest- and next-nearest residues along the sequence. To make 
the RWCO useful for molecular dynamics simulations, the contact between two 
residues is defined by a smooth sigmoid function: 

Cij = 1/{1 + exp[w{r,,j - 4)]} (2) 

where rij is the distance between atoms of the i-th and j-th residues (Ca 
atoms for glycine), dc is the cut-off distance for the contact definition, and w is a 
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parameter that determines the sharpness of the sigmoid function. To be consistent 
with our previous studieP^I, we set dc = 12A and w = 3 throughout the present 
paper. 

We also define the normalized (relative) RWCO by 

1/f = K-(or))/v/(K-(or))^) (3) 

where (■) denotes averaging operation over the given protein chain p. 

Prediction scheme 

To predict the RWCO of each residue in a protein, we first conduct three iter- 
ations of PSI-BLAST^ search against the NCBI non-redundant amino acid se- 
quence database to obtain the sequence profile of the protein with the E-value 
cut-off of 10^. We use the amino acid score table of the PSTBLAST profile 
which is represented as f{i, a) (i: site, a: amino acid) in the following (instead of 
the frequency table used in the previous stud)^). 

The RWCO of of the i-th residue in the protein p is predicted in two steps. First 
we predict the normalized RWCO yf for each residue, and then we combine it with 
the mean and standard deviation (S.D.) of the RWCOs of the protein, which 
are predicted separately. The normalized RWCO is predicted by the following 
linear regression scheme: 

M residue types 

yf= E Cm,aF{^ + m,a) + C (4) 

m=-M a 

where M is the half window size (a free parameter to be determined), f^(i + m, a) 
represents an element of the PSI-BLAST profile of the protein p, and Cm,a and 
C are regression parameters. Both amino and carboxyl termini are treated by 
introducing an extra symbol for the "terminal residue." Thus, the RWCO of the 
i-th residue is expressed as a linear function of the local sequence of 2M + 1 
residues surrounding the i-th residue. 
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The values of Cm,a and C are determined so as to minimize the prediction 
error over a database of protein structures. The error function is defined by 

E = j:j:iyf-yff (5) 

P i 

where yf is the observed normalized RWCO of the i-th residue of the protein p. 
The minimization of E can be achieved by the usual least squares method. 

The mean (fi^) and standard deviation (a^) of the RWCOs of a protein are 
predicted from the amino acid composition (/^) and sequence length (L^) of the 
protein p in the same manner as we have done for the contact number predictioiP. 
That is, the mean and S.D. are predicted by the following linear regression scheme: 

fiP = 5:^,/^ + + A (6) 

a 

aP = Y.Dafa+DiF{L^) + D (7) 

a 

where F^L^) = for < 300 and F{Lp) = 300 for Lp > 300, and Aa, A, D^, D 
are regression parameters. The final value for the predicted absolute RWCO (of) 
is given by 

of = /i^ + aPyl (8) 

Data set 

We first selected representative proteins from each superfamily of all-a, all-/?, 
a/ (3, a + (3, and multi-domain classes of the SCOI^ (version 1.65) protein struc- 
ture classification database through the ASTRAL0 database. Those structures 
which were present in this superfamily representative set but were absent from 
the 40% representative set of ASTRAL, those containing chain breaks (except 
for termini), or those with the average contact number of less than 7.5 (non- 
compact structures) were discarded. Non-standard amino acid residues were con- 
verted to the corresponding standard residues when possible, otherwise discarded. 
When atoms were absent in non-glycine residues, they were modeled by the 
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SCWRLpl side-chain prediction program. After all, there remained 680 protein 
chains. The list of this data set will be available from the author's website. 

For training the parameters and testing the prediction accuracy, we performed 
a 15-fold cross-validation test. The 680 proteins were randomly divided into two 
groups, one consisting of 630 proteins for training the parameters (training set), 
and the other (test set) consisting of 50 proteins for testing the prediction using 
the parameters obtained from the training set. The procedure was iterated for 15 
times. 

Measures of prediction accuracy 

We employ two measures for evaluating the prediction accuracy. The first one is 
the correlation coefficient (Corp) between the observed and predicted RWCOs for 
a given protein p, which is defined by 

Cor = ((^^-(^^))(^^-(^^))) (9) 

The Corp measures the consistency of the normalized RWCOs. In order to mea- 
sure the accuracy of the predicted absolute values, we use the RMS error divided 
by the standard deviation of the observed RWCO (DevAp): 

DevA„ = ^ (10) 

Results 

Optimal window size 

In the prediction scheme presented in this paper, the half window size M is a free 
parameter. We determine its value so that the prediction accuracy is maximized. 
We have performed a 15-fold cross-validation test with M ranging from to 40. 
The result is summarized in Figure [T] The correlation coefficient Corp (averaged 
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over the test sets) ranges from 0.48 at M = to 0.59 at M = 26 (Figure [2 A). 
It should be noted that the correlation of 0.48 is already statistically significant 
given the average sequence length (172 residues) of the proteins in the data set. 
The value of CoVp monotonically increases from M = to M = 26, but starts 
to saturate for M > 20 and decreases slowly for M > 26. The deviation DevAp 
(averaged over the test sets) shows a consistent trend with CoVp (Figure[l]B), and 
it reaches the minimum value of ^ 1.03 at M = 26. Thus, the optimal window 
size has been determined to be M = 26. 

This optimal window size of M = 26 is much larger than the ones for any 
other ID structure predictions. As far as we are aware, this is the longest range 
of correlation observed between ID structure and amino acid sequence. For ex- 
ample, the optimal half window size is M = 9 for contact number prediction 
(see below) and M = 6 — 8 for secondary structure prediction. Large window 
sizes usually result in over-fitting the training data, but such is not the case for 
RWCO prediction, as we have performed cross-validation tests. This unusually 
long-range correlation with amino acid sequence is a conspicuous property of the 
RWCO. 

Distribution of correlation 

As indicated by the average values of Covp and DevAp, the linear regression 
method with M = 26 tends to produce more accurate predictions than with other 
window sizes. However, the prediction accuracies for individual proteins do dif- 
fer significantly as shown in Figure |2l While most of the proteins are decently 
predicted with correlations of 0.5 or higher, some proteins exhibit very poor cor- 
relations. The poorly predicted proteins are found not well-packed due to the 
small size of the protein (e.g., SCOP domain dlfslal), a large fraction of struc- 
turally disordered regions (e.g., dlcpo_l), or being a subunit of a large complex 
(e.g., dlmtyg_). 
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The prediction accuracy does not strikingly differ depending on the structural 
class of proteins (Table [T])- However, all-a proteins show slightly poorer correla- 
tions compared to other classes, and a + P proteins show relatively better corre- 
lations. The latter may be due to the over-dominance of the a + (3 proteins in the 
data sets. 

In Figure |31 three examples of predicted RWCO are shown. Despite the rel- 
atively good correlation between the native and predicted RWCOs, the absolute 
values of predicted RWCOs at many sites significantly differ from the correspond- 
ing native RWCOs. This behavior is indicated by the relatively large value of 
DevAp ^ 1.03 (Figure[l]B). In particular, we notice that RWCOs of large values 
are consistently underestimated. This behavior suggests that some cooperative ef- 
fects be taken into account for better prediction. Provided that the present method 
is based exclusively on one-body terms (Eq. HI), the prediction accuracy achieved 
is satisfactory, at least qualitatively. 

Regression parameters as functions of sequence position 

Since the present study is the first attempt to develop a prediction method for 
RWCO, it is of interest to examine the properties of amino acid residues that affect 
the RWCO, which are reflected in the values of the regression coefficients Cm,a- 
Figure m shows the values of Cm,a for each amino acid type a as a function of the 
window position m. For all the amino acid types, the peak of C,n,a, when present, 
is at the center (m = 0). We can easily recognize that these values, those at m = 
in particular, are related to the hydrophobicity of amino acids. That is, Co,a > 
for hydrophobic residues and Co,a < for hydrophilic residues. When the amino 
acid index (AAindex) databaseP' was scanned for indices that highly correlates 
with Co,a, we have found various hydrophobicity scales with correlations with 
Co,a over 0.90 (data not shown). Therefore, we can conclude that the RWCO is 
primarily determined by the pattern of hydrophobicity along the sequence. 
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Some amino acid types exhibit oscillation with the periodicity of 3 to 4 residues, 
which is expected for the ot helix. In fact, such residues (e.g., GLU, GLN, ALA, 
etc.) are of high a helix propensity. On the contrary, the residues of high /? strand 
propensity (e.g., ILE, VAL, etc.) do not exhibit such oscillation. Therefore, in 
addition to the hydrophobic properties, the parameters for RWCO also contain 
information for secondary structures. 

Discussion 

Comparison with contact number prediction 

As can be seen from their definitions, the native RWCOs and contact numbers 
show a high correlation of 0.7 (data not shown). This is also consistent with 
the finding that RWCOs are primarily determined by hydrophobicity. Because 
of the correlation between RWCO and contact number, it is of interest to ask 
whether it is possible to "predict" RWCOs using contact number prediction, and 
vice versa. The result of this "cross -prediction" is listed in Table |21 Here, the 
contact number predictioil^is based on exactly the same linear regression scheme 
as the RWCO prediction method. In order to make consistent the quality of the 
two different prediction methods, we have determined the regression parameters 
and the optimal half window size for the contact number prediction using the same 
training and test data sets as used here. The resulting contact number prediction 
method yields the average prediction accuracy of Cor^ ^ 0.70 and DevAp ^ 
0.803 with the optimal half window size of 9 (Table El Case B), a remarkable 
improvement over our previous study {Corp ^ 0.63 and DevAp ^ 0.941)^ which 
is likely to be due to the use of PSTBLAST score profiles (we used frequency 
profiles derived from the HSSP databas^in the previous study). When the values 
obtained from the contact number prediction are compared to the native RWCOs, 
the highest correlation is 0.50 with the optimal half window size of M = 4 (Table 
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121 Case C). Although the correlation of 0.50 is statistically significant, the value 
is much lower than the one obtained for the proper prediction of RWCO, CoVp ^ 
0.59 (TableEl Case A). For the "prediction" in the opposite direction, that is, when 
the values obtained from the RWCO prediction are compared to the native contact 
numbers, the correlation is as high as 0.62 with the optimal half window size of 
M = 4 (Table 121 Case D). Again, this value, though statistically significant, is 
lower than the proper contact number prediction {CoVp ^ 0.70). Interestingly, for 
the Cases C and D in Table El the optimal half window sizes coincide (M = 4). 
Therefore, it is expected that the contact number and RWCO are very closely 
related with each other in terms of the short-range pattern of the local amino acid 
sequence. In other words, the distinction between the contact number and RWCO 
originates from the interactions of longer range. 

To further clarify the correlation between RWCO and contact number predic- 
tions, we compared the regression parameters Cm,a for RWCO and contact num- 
ber predictions up to the half window size of Af = 9 (Figure jSj. It can be clearly 
seen that the both sets of regression parameters very significantly correlate (cor- 
relation of > 0.7) with each other within the window positions of —4 < m < 4 
(Figure O, which confirms the above observation (Table |2l Cases C and D). 

Perspective for improving prediction accuracy 

The method for predicting RWCOs from amino acid sequence developed in this 
paper is a very primitive one. While the correlation of 0.59 between the native and 
predicted RWCOs is significant, it is not as high as 0.70 in the case of the contact 
number prediction (Table O based on the same linear regression scheme. Fur- 
thermore, the agreement of absolute RWCO values is relatively poor, especially 
so for RWCOs of large values. As mentioned above, inclusion of many -body ef- 
fects seems mandatory for better RWCO prediction. A popular method for dealing 
with many-body terms is artificial neural networks. Other non-linear regression 
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schemes such as radial basis or support vector regressions can be also applica- 
ble. Neural network methods as well as a support vector regression method have 
been successfully applied to real value prediction of solvent accessibilitjEl'^'^ 
Solvent accessibility is closely related to the hydrophobicity of amino acids, and 
hence is likely to be related to the RWCO. Thus, we can expect such non-linear 
regression approaches may be also useful for predicting RWCO. However, since 
the RWCO prediction requires rather long segment of local amino acid sequence 
(half window size of M = 26), straightforward application of non-linear regres- 
sion methods requiring a great number of parameters may not work. The number 
of parameters must be somehow reduced. How to extract essential parameters for 
RWCO prediction is left for future studies. 

An alternative route to the improved accuracy is to properly treat the large 
deviation of RWCOs along the amino acid sequence. For the contact number, 
its average over a local segment tends to be close to the average over the whole 
sequence, whereas, for the RWCO, such is not the case. For example, for the 
SCOP domain dla9xbl (Figure |3jl^), the average contact number for the whole 
domain, for residues 1 to 20, and for residues 51 to 70 are, respectively, 25.5, 28.4, 
and 26.6, whereas the corresponding averages of the RWCOs are 8.0, 14.3, and 
4.9, respectively. Since the present method is based on the globally normalized 
RWCO (Eq. |3l), such large deviations are difficult to handle. If this limitation is 
overcome, better prediction accuracy may be obtained. 
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Table 1: Distribution of Cor„ for each SCOP class". 



range* 




SCOP class'^ 




(Corp) 


a 


b 


c d 


e 


(-1,0.2] 


4(3) 


1(0.6) 


7(4) 2(0.8) 





(0.2,0.4] 


23(14) 


17(10) 


14(8) 22(9) 


1(5) 


(0.4,0.6] 


61(38) 


54(33) 


55(33) 72(30) 


11(61) 


(0.6,0.8] 


73(45) 


86(52) 


82(49) 136(57) 


6(33) 


(0.8,1.0] 


1(0.6) 


6(4) 


8(5) 8(3) 





total 


162 


164 


166 240 


18 



" The number (percentage in the parentheses) of occurrences of Covp for the 
teins in the test sets, classified according to the SCOP database. 

* The range "(x, y]" denotes x < Cor-p < y. 
a: all-a, b: a\\-/3, c: a//3, d: a + /3, e: multi-domain. 
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Table 2: Cross-prediction between residue-wise contact orders and contact num- 
bers. 



Case 


Train" 


Test" 


M" 


CoTp 


DevAp 


A 


RWCO 


RWCO 


26 


0.59 


1.03 


B 


CN 


CN 


9 


0.70 


0.803 


C 


CN 


RWCO 


4 


0.50 


N.A."^ 


D 


RWCO 


CN 


4 


0.62 


N.A.'^ 



"Target values for which the regression parameters were trained. "RWCO" and 
"CN" indicate that the regression parameters were trained to fit the residue-wise 
contact orders and contact numbers, respectively. 

"Target values for which the "prediction" was applied. "RWCO" and "CN" in- 
dicate that predicted values were compared with the native residue-wise contact 
orders and native contact numbers, respectively. 
'^Optimal half window size for the prediction. 

'^Not applicable because the ranges of RWCO and CN values are different. 
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Figure 1: Prediction accuracy as a function of window size. (A) The correlation 
coefficient (Cor-p) between the native and predicted RWCO, averaged over the test 
set proteins. (B) Deviation of the predicted RWCO from the native one (DevAp), 
averaged over the test set proteins. 
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Figure 2: CoVp plotted against chain length. Each point represents a protein in 
one of the test sets. 
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Figure 3: Examples of prediction. Red: native RWCO; Green: predicted RWCO. 

(A) SCOP domain dla6m__ (myoglobin, all-a), Corp = 0.73, DevAp = 0.75; 

(B) SCOP domain dlifra_ (Lamin A/C globular tail domain, a\\-/3), Corp = 0.72, 
DevAp = 0.87; (C) SCOP domain dla9xbl (Carbamoyl phosphate synthetase, 
small subunit N-terminal domain, a/P), Corp = 0.72, DevAp = 0.81. 
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Figure 4: Cm,a for each amino acid type (a) as a function of the window position 
(m). 
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Figure 5: Correlation between the regression parameters Cm,a for contact num- 
ber and RWCO predictions for each window position. The horizontal axis is the 
window position m in the local sequence. The vertical axis is the correlation co- 
efficient between the regression parameters Cm,a for RWCO prediction and those 
for contact number prediction at the window position m. 
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